Setting ups

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(plotly))

Part 1: Factor management

First we need to ensure which of the columns are factors. From the output of str function we can see here country and continent are factors. The number of levels in country is 142, and 5 for the continent.

gapminder %>% 
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gapminder$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Drop Oceania

Then we can apply some filter to the data frame followed by droplevels and see what happens. Here we see the unused levels of the factors are gone. 2 of the levels in country are gone, and the Oceania drops from continent. Via the filter, 24 rows of the data are also removed and the number of rows decreases from 1704 to 1680.

gap_wo_oc <- gapminder %>%
  filter(continent != "Oceania") %>%
  droplevels()

gap_wo_oc %>%
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gap_wo_oc$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"

Reorder the levels of country

By using fct_reorder we can easily reorder the levels in the factors.

big_pop <- gapminder %>% 
  filter(continent == "Americas") %>%
  group_by(country) %>%
  summarise(pop = pop[year == 2007]) %>%
  filter(pop > 1e7)

big_pop %>%
  mutate(country = fct_reorder(country, pop)) %>%
  ggplot(aes(pop, country)) +
  geom_point()

Explore the effects of arrange

By applying arrange, we can see that the order in the data frame actually changed, but there are no affect on the plots.

big_pop_ordered <- big_pop %>%
  arrange(pop)

knitr::kable(big_pop_ordered)
country pop
Cuba 11416987
Guatemala 12572928
Ecuador 13755680
Chile 16284741
Venezuela 26084662
Peru 28674757
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
big_pop_ordered %>%
  ggplot(aes(pop, country)) +
  geom_point()

So with fct_reorder and arrange now thr data frame is reordered in both tables and the plots.

big_pop_reordered <- big_pop %>%
  mutate(country = fct_reorder(country, pop)) %>%
  arrange(pop)

knitr::kable(big_pop_reordered)
country pop
Cuba 11416987
Guatemala 12572928
Ecuador 13755680
Chile 16284741
Venezuela 26084662
Peru 28674757
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
big_pop_reordered %>%
  ggplot(aes(pop, country)) +
  geom_point()

Part 2: File I/O

In this part we can use the previous big_pop data frame and do some file IO experiments. First we save the data frame using write_csv, write_tsv, saveRDS and dput.

big_pop_saved <- tail(big_pop_reordered, 6) %>%
  droplevels()

knitr::kable(big_pop_saved)
country pop
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
write_csv(big_pop_saved, "big_pop.csv")
write_tsv(big_pop_saved, "big_pop.tsv")
saveRDS(big_pop_saved, "big_pop.rds")
dput(big_pop_saved, "big_pop.txt")

Then we read them back and see what happens.

big_pop_csv <- read_csv("big_pop.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   pop = col_integer()
## )
knitr::kable(big_pop_csv)
country pop
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
big_pop_tsv <- read_tsv("big_pop.tsv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   pop = col_integer()
## )
knitr::kable(big_pop_tsv)
country pop
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
big_pop_rds <- readRDS("big_pop.rds")
knitr::kable(big_pop_rds)
country pop
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947
big_pop_txt <- dget("big_pop.txt")
knitr::kable(big_pop_txt)
country pop
Canada 33390141
Argentina 40301927
Colombia 44227550
Mexico 108700891
Brazil 190010647
United States 301139947

Part 3: Visualization design

The graph below shows where we start. It is not too bad, but with a little more working we can make it different.

gapminder %>% 
  ggplot(aes(gdpPercap, 
             lifeExp, 
             color = year)) + 
  geom_point() + 
  scale_x_log10()

Now here is what it looks like after some hard working. It is more readable, with more infos, and has a title now so people know what it is for.

(p <- 
  gapminder %>% 
  filter(continent != "Oceania") %>%
  ggplot(aes(gdpPercap, lifeExp)) + 
  geom_point(aes(color = year), alpha = 0.4) + 
  scale_x_log10(labels = dollar_format()) +
  scale_colour_distiller(palette = "Greens") +
  facet_wrap(~continent) + 
  scale_y_continuous(breaks = 10*1:10) +
  theme_bw() +
  labs(x = "Gdp per capita",
       y = "Life expectency",
       title = "Life expectency against gdp per capita of four continents"
       ) +
  theme(axis.text = element_text(size = 14),
        strip.background = element_rect(fill = "orange"),
        panel.background = element_rect(fill = "gray")
        )
)

Then we can trun this into a plotly graph. It allows more interactions with the graph compared with ggplot graph, including zoom in, select regions you want to see, or check the value of every point in the plot. It is cool and more suitable for presentations!

ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Part 4: Writing figures to file

By using ggsave we can save the plot to a local file.

ggsave("plot.png", p)
## Saving 7 x 5 in image

Then we can load it back from files in a html way.

We can actually save them in different format and different quality.

ggsave("plot_l.bmp", p, device = "bmp", height = 5, width = 5, dpi = 50)
ggsave("plot_h.bmp", p, device = "bmp", height = 5, width = 5, dpi = 200)

Then we can load them back and see the difference.